-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow configuring collection of statistics during TPC-H benchmarks #3889
Conversation
There are currently only two physical plans that differ when we pass I'll try to inspect other plans to see what else we can do (or if there is anything obvious we are missing, except filters which are in progress in #3868). But I'd expect the majority of the speed-ups in here to come from global join ordering (unlike only locals we have) with #3843 so no high hopes to see a magic bullet just yet 😇 |
@@ -98,6 +98,10 @@ struct DataFusionBenchmarkOpt { | |||
/// Path to output directory where JSON summary file should be written to | |||
#[structopt(parse(from_os_str), short = "o", long = "output")] | |||
output_path: Option<PathBuf>, | |||
|
|||
/// Whether to disable collection of statistics (and cost based optimizations) or not. | |||
#[structopt(short = "S", long = "disable-statistics")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to keep defaults the same as datafusion-cli (statistics disabled)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking about it but then decided against it since the default is already true
. Do you know whether we have any places where we continuously run benchmarks where making the default false
would change results? If it wouldn't cause a regression on places where we compare benchmarks across datafusion commits/versions, I think it should be an easy to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no current "continuous benchmarking" in place for this. If that would be the case, it would make sense to run the benchmarks both with and without collecting statistics.
Anyway, good that we can enable / disable them 👍
Thanks @isidentical |
Thanks for resolving conflicts @Dandandan 🙏🏻 By the way, just as a question (or maybe something we might be interested to discuss in general), are there any plans/efforts to do continuous benchmarking to ensure that no regression goes unnoticed? (I was very interested in @andygrove 's benchmarking of ballista/spark, so maybe even doing it both for ballista/datafusion). |
Which issue does this PR close?
Closes #3888 .
Rationale for this change
As described in #3888, it is useful to see a benchmark with/without statistics to determine what sort of effect it will have.
What changes are included in this PR?
A new flag,
--disable-statistics
to turn off statistic collection.Are there any user-facing changes?
No.